import warnings
warnings.filterwarnings('ignore')
With an increasingly interconnected and data-rich world, networks are an ubiquitous feature of modern life. These are manifested in a wide variety of fields such as social networks, global value chains, disease outbreaks, mobile phone networks, internet browsing, vehicular flows, transportation, and finance, among others. Network analysis may allow us to:
For instance, the below image shows political blogs prior to the 2004 US Presidential election, which reveals two densely-knit and well-separated communities.
(Source: Easley & Kleinberg, 2010)
The key elements of a graph network are nodes and edges:
To illustrate these concepts, a simple network graph as well as sample node and edge data tables follow:
(Adapted from Professor Taylor Corbett's Data Visualization Lecture 9 for the Fall 2021 Semester)
NetworkX is a Python package for creating, manipulating, visualizing, and studying the structure, dynamics, and functions of complex networks. It can handle graphs with up to 10 million rows and around 100 million edges.
NetworkX allows users to "load and store networks in standard and nonstandard data formats, generate many types of random and classic networks, analyze network structure, build network models, design new network algorithms, draw networks, and much more".
While NetworkX is one of the most popular Python packages for creating and manipulating graphs and networks, its primary goal is to facilitate graph analysis rather than perform graph visualization. However, the package does include basic drawing functionalities using Matplotlib. Hence, plotting within NetworkX would be more appropriate for simpler networks or for exploratory data analysis. For more advanced graph visualizations, network data within NetworkX may be exported and fed into fully-featured graph visualization tools such as the open source software package Graphviz.
NetworkX requires Python 3.7 or newer.
To install the latest release of the package, run pip install networkx[default].
To install the package without the dependencies (e.g., numpy, scipy), run pip install networkx.
Alternatively, manual downloads are also possible through Network's GitHub or PyPI repositories.
Graph creation is fundamentally comprised of 5 steps:
import networkx as nxg = nx.Graph()g.add_node(node) g.add_edge(node_1, node_2)nx.draw(g)Note: Step 3 may be skipped as nodes will automatically be created when edges are created between non-existent nodes.
# import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
# Create a networkx graph object
my_graph = nx.Graph()
# Add edges to to the graph object
# Each tuple represents an edge between two nodes
my_graph.add_edges_from([
(1,2),
(1,3),
(3,4),
(1,5),
(3,5),
(4,2),
(2,3),
(3,0)])
# Draw the resulting graph
nx.draw(my_graph, with_labels=True, font_weight='bold')
We could also import network data from a dataframe by calling the from_pandas_edgelist function and then specifying the source dataframe and then the columns containing the linked nodes. For directional links, the source node must be specified before the target node.
The draw function also allows for graph customization such as node labeling (with_labels), node size (node_size), transparency (alpha), and edge width (linewidths).
# create a dataframe
df = pd.DataFrame({'from': ['A', 'B', 'C', 'A'],
'to': ['D', 'A', 'E', 'C']})
# create graph object
G = nx.from_pandas_edgelist(df, 'from', 'to')
# plot the network graph
nx.draw(G, with_labels=False, node_size=500, alpha=1, linewidths=20)
We could likewise specify node positions (pos), which may come in especially handy to avoid overlapping nodes in more complex networks. There are various options available to tweak the different graph elements. To fix formatting, different matplotlib features may also be integrated in the graphs.
# create nodes and edges
G = nx.Graph()
G.add_edge(1, 2)
G.add_edge(1, 3)
G.add_edge(1, 5)
G.add_edge(2, 3)
G.add_edge(3, 4)
G.add_edge(4, 5)
# set positions
pos = {1: (0, 0), 2: (-1, 0.3), 3: (2, 0.17), 4: (4, 0.255), 5: (5, 0.03)}
options = {
"font_size": 36,
"node_size": 3000,
"node_color": "white",
"edgecolors": "black",
"linewidths": 5,
"width": 5}
nx.draw_networkx(G, pos, **options)
# Set margins for the axes so that nodes aren't clipped
ax = plt.gca()
ax.margins(0.20)
plt.axis("off")
plt.show()
To create a directed (i.e., directional) graph, we can use the DiGraph function. Again, the source node must be specified before the target node during the creation of the graph object.
# create graph object
G = nx.DiGraph([(0, 3), (1, 3), (2, 4), (3, 5), (3, 6), (4, 6), (5, 6)])
# group nodes by column
left_nodes = [0, 1, 2]
middle_nodes = [3, 4]
right_nodes = [5, 6]
# set the position according to column (x-coord)
pos = {n: (0, i) for i, n in enumerate(left_nodes)}
pos.update({n: (1, i + 0.5) for i, n in enumerate(middle_nodes)})
pos.update({n: (2, i + 0.5) for i, n in enumerate(right_nodes)})
nx.draw_networkx(G, pos, **options)
# Set margins for the axes so that nodes aren't clipped
ax = plt.gca()
ax.margins(0.20)
plt.axis("off")
plt.show()
After a survey of NetworkX's basic plotting functionalities, let us try to analyze and visualize a sample social network through Marvel Cinematic Universe Social Network data sourced from Tableau.
# import data
msn = pd.read_csv("https://github.com/mkbunyi/Data-Viz-Tutorial-NetworkX/raw/main/marvel_social_network.csv")
msn
A cursory inspection of the Marvel dataset shows that we must reformat the data from "long" format (where linked characters are listed in separate rows) to "wide" format (where one row corresponds to one link or Line ID) in order to properly feed the data into our Network graph object.
# reformat to combine linked characters in 1 row
# collapse rows by line ID and combine linked characters in a list per cell
msn_nx = msn.groupby('Line ID').agg(lambda x: x.tolist())
msn_nx = msn_nx[["Character Name","Character ID"]]
# split list and allocate separate columns for the linked characters
msn_nx = pd.concat([msn_nx["Character Name"].apply(pd.Series),
msn_nx["Character ID"].apply(pd.Series)],
axis=1).reset_index()
# add relation type
msn_nx = msn_nx.merge(msn[["Line ID","Relation","Relation Sentiment"]],
on="Line ID", how = "left").drop_duplicates()
# rename columns
msn_nx.columns = ['Line ID', 'Char1_Name', 'Char2_Name', 'Char1_ID', 'Char2_ID', 'Relation', 'Relation Sentiment']
# reset index
msn_nx = msn_nx.reset_index()
# view reformatted data
msn_nx
The dataframe is now appropriate for use into a NetworkX graph object. We just add one last variable to specify colors corresponding to the Relation Sentiment per relationship, which we can subsequently reference once we set edge colors.
# set color according to relation sentiment
msn_nx['color'] = np.where(msn_nx['Relation Sentiment']=="Positive",
"green",
"black")
msn_nx['color'] = np.where(msn_nx['Relation Sentiment']=="Negative",
"red",
msn_nx['color'])
We can now manipulate the data in NetworkX.
First, we initialize the graph object by specifying our source dataframe, nodes, and edge attributes (edge_attr).
# Initialize a graph object
G = nx.from_pandas_edgelist(msn_nx,
'Char1_Name',
'Char2_Name',
edge_attr=["Relation","Relation Sentiment"])
We can choose from various configurations for the graph visualization, such as:'bipartite_layout', 'circular_layout', 'kamada_kawai_layout', 'random_layout', 'rescale_layout', 'shell_layout', 'spring_layout', 'spectral_layout', 'fruchterman_reingold_layout'. For this example, we shall use 'kamada_kawai_layout', which positions nodes using a path-length cost function. I settled with this layout due to its minimal node overlap compared to other configurations. NetworkX documentation describes the methods of the different node positioning algorithms for graph drawing.
# Generate layout for visualization
pos = nx.kamada_kawai_layout(G)
Based on our chosen layout's algorithm, node positions will be generated automatically. However, we can also perform manual tweaking to address node overlaps in the visualization, among others. To highlight the effect, we will nudge Captain America to the bottom of the graph.
It is also possible to dispense with NetworkX's built-in layouts and specify user-determined coordinates for all nodes.
# Manual position tweaking
pos["Captain America"] += (0, -1)
We can also customize node size. In this example, we will set node size proportional to the number of links that the Marvel character has.
# node size is proportional to number of links
links=dict.fromkeys(G.nodes(),0.0)
for (node1,node2,attrib) in G.edges(data=True):
links[node1]+=1
links[node2]+=1
Aside: As mentioned, the graph object also stores additional information on the nodes and edges, which may be called in outside functions (in this case, to generate the number of links an entity possesses). To help imagine the data structures of theses nodes and edges, here is a sample view of our current graph object's nodes.
G.nodes()
We are now ready to fix the plot elements (through matplotlib) and visualize graph components.
fig, ax = plt.subplots(figsize=(40, 40))
# draw edges
nx.draw_networkx_edges(G, pos, alpha=1, width=5,
edge_color=[msn_nx["color"][i] for i in list(range(len(msn_nx)))])
# draw nodes
nx.draw_networkx_nodes(G, pos,
node_size = [links[i]*500 for i in G],
node_color="blue", alpha=1,
label=[msn_nx["Char1_Name"][i] for i in list(range(len(msn_nx)))])
# draw labels
label_options = {"ec": "black", "fc": "white", "alpha": .9}
nx.draw_networkx_labels(G, pos, font_size=30, bbox=label_options)
# display title
font = {"color": "black", "fontweight": "bold", "fontsize": 40}
ax.set_title("Marvel Social Network", font)
# Resize figure for label readibility
ax.margins(0.1, 0.05)
fig.tight_layout()
plt.axis("off")
plt.show()
As mentioned, NetworkX may not be suitable for production-quality visualizations, especially for complex networks. However, we can get a few insights from the above graph on the nature of relationships existing in our sample social network (which looks like a good mix of positive and negative, although leaning towards more positive, as well as rarely neutral connections). We also see larger nodes for characters with more extensive connections, such as Captain America, Black Widow, and Hulk. Subsetting for well-connected or less-connected characters may also be done to generate more easy-to-understand visualizations. For instance, we can use NetworkX's graph analysis techniques to check which characters possesses the most links and then generate a graph visualization for that character:
To start, we can view sample relationships for our dataset's first character, Abomination:
# view Abomination's links and the nature of these links
G['Abomination']
# count the number of connections
len(G['Abomination'])
Using these graph analysis functions, we can check which character has the most connections and then zoom in our analysis on that character.
# initialize
top_links = {}
# iterate through nodes to count connections
for char in G.nodes:
top_links[char] = len(G[char])
# convert to dataframe
s = pd.Series(top_links, name='connections')
df = s.to_frame().sort_values('connections', ascending=False)
df
We can then subset our data and visualize the connections of the character atop our list -- Black Widow.
msn_blackwidow = msn_nx[msn_nx["Char1_Name"]=="Black Widow"].reset_index()
msn_blackwidow
We can now visualize Black Widow's network like before. However, this time we will try to use the spring layout which will put Black Widow, our main entity of interest, in the middle of the graph.
# initialize plot
fig, ax = plt.subplots(figsize=(10, 8))
# Initialize a graph object
G = nx.from_pandas_edgelist(msn_blackwidow,
'Char1_Name',
'Char2_Name',
edge_attr=["Relation","Relation Sentiment"])
# Draw using a spring layout
nx.draw_spring(G,with_labels=True,
edge_color=[msn_blackwidow["color"][i] for i in list(range(len(msn_blackwidow)))],
node_color = "gainsboro",
node_size = 2000,
font_size=11)
# Resize figure for label readibility
fig.tight_layout()
plt.axis("off")
plt.margins(x=0.4)
plt.show()
This simple demonstration barely scratches the surface of the possibilities within NetworkX and the use cases for network analysis. One possible application is determining connected components, where we would like to identify distinct groups within our dataset.
Connected components¶Adapted from Rahul Agarwal, Towards Data Science (August 2019)
Real-world scenarios where the connected components algorithm could be potentially useful are:
# load in data
cities = pd.read_csv("https://github.com/mkbunyi/Data-Viz-Tutorial-NetworkX/raw/main/distances.csv")
cities.head()
# create graph object along with nodes and edges
g = nx.Graph()
for edge in range(len(cities)):
g.add_edge(cities["node1"][edge],
cities["node2"][edge],
weight = cities["distance"][edge])
NetworkX has a rich repertoire of graph analysis tools that don't require visualization. We could easily identify distinct groups of connected components by runnning our graph object against the connected_components function:
for i, x in enumerate(nx.connected_components(g)):
print("cc"+str(i)+":",x)
We could also plot the graph and eyeball the results. This time, we will use the 'spring layout', where we can set k as the optimal distance between nodes. The higher the value for k, the larger the distance between nodes.
# set layout
pos = nx.spring_layout(g, k=5, seed=10)
# plot the network
nx.draw(g,pos,
with_labels = True, #labels nodes
node_color='lightsteelblue',
edge_color='slategrey')
# label edges (distance between cities)
edge_labels = nx.get_edge_attributes(g,'weight')
nx.draw_networkx_edge_labels(g,pos,edge_labels=edge_labels,
font_size=9,rotate=False)
plt.show()
(Adapted from Rahul Agarwal, Towards Data Science (August 2019) and the developer guide of open source graph database Neo4j)
Network analysis is also helpful in determining the shortest distance between two points. For instance, continuing the city distances example, we can compute for the shortest distance from Frankfurt to Stuttgart and the path that we need to traverse to cover this distance. Similar algorithms are used in tools such as Google Maps, grocery shopping, and even computation of LinkedIn connections.
print(nx.shortest_path_length(g, 'Frankfurt','Stuttgart',weight='weight'))
print(nx.shortest_path(g, 'Frankfurt','Stuttgart',weight='weight'))
This algorithm measures node importance based on the number and quality of its (incoming and outgoing) links. Its use cases include:
Below is a sample graph showing betweenness centrality within a Facebook user network. The Pagerank algorithm will give a higher score to a user with extensive friend lists who also have extensive friend lists. The most influential user in this network is marked by a yellow dot.
NetworkX covers various centrality measures, some of which are as follows:
Below is a sample graph showing betweenness centrality within a Facebook user network. Users with larger nodes represent users with higher influence who serve as information passers between different groups.